Restructure Cosmos benchmark agent to 3 deterministic skills by xinlian12 · Pull Request #48165 · Azure/azure-sdk-for-java

xinlian12 · 2026-02-27T22:01:24Z

Cosmos Benchmark Agent

Overview

Adds a Copilot-powered Cosmos DB benchmark agent that automates the full benchmark lifecycle: provisioning infrastructure, running benchmarks, and analyzing results. The agent is organized into 3 deterministic, script-driven skills with a clear sequential workflow:

setup-resources → run → analyze

How to Use (Copilot CLI)

1. Select the agent

From the Copilot CLI, use @ to select the cosmos-benchmark agent:

$ @cosmos-benchmark setup resources for a benchmark run in West US 2

Or start a session and the agent will be auto-selected based on context when working under sdk/cosmos/azure-cosmos-benchmark/.

2. Example workflows

Full benchmark — provision, run, analyze:

You: setup resources with 50 cosmos accounts in West US 2
Agent: ✅ Created 50 accounts, 1 VM, exported config to ~/dev/benchmark-config

You: run benchmark on origin/main and xinlian12/wireConnectionSharingInBenchmark, simple preset, 10 min
Agent: ✅ Launched. Tmux running on VM. Polling...
Agent: ✅ origin/main completed. Starting next ref...
Agent: ✅ Both refs completed.

You: analyze results
Agent: 📊 Downloaded results. Generating comparison report...

Quick validation of a PR:

You: run benchmark on PR#12345 vs main, simple preset

Check on a running benchmark:

You: peek
Agent: ✅ Tmux running. 15 monitor samples. Threads: 358, Heap: 4GB/8GB, CPU: 18.7%

Reuse existing infrastructure:

You: setup resources, reuse existing cosmos accounts in rg-benchmark-west, VM at 20.98.84.14

3. Key commands

What you say	What happens
`setup resources`	Provisions Cosmos accounts, App Insights, VM
`run benchmark on <refs>`	Builds JAR, generates tenants.json, runs on VM in tmux
`peek` / `check status`	Shows tmux state, monitor metrics, results status
`analyze results`	Downloads from VM, generates comparison report
`capture diagnostics`	Takes thread/heap dump of running benchmark

Agent Structure

azure-cosmos-benchmark/
└── copilot/
    ├── agents/
    │   └── cosmos-benchmark.agent.md          # Routing table → 3 skills
    └── skills/
        ├── cosmos-benchmark-setup-resources/  # Step 1: Azure infrastructure
        ├── cosmos-benchmark-run/              # Step 2: Build & execute
        ├── cosmos-benchmark-analyze/          # Step 3: Download & report
        └── skill-creator/                     # Meta: skill authoring guide

Skills

1. `setup-resources` — Provision Azure Infrastructure

Creates Cosmos DB accounts, Application Insights, and Azure VMs. Exports credentials to a config directory consumed by downstream skills.

Script flow:

provision-all.sh                              # Entrypoint: orchestrates all provisioning
│
├── [1/5] validate-capacity.sh                # Pre-flight: check region has VM SKU + Cosmos capacity
│         └── Outputs capacity-check.json     #   Blocks provisioning if checks fail
│
├── [2/5] az group create                     # Create resource group
│
├── [3/5] Parallel resource creation ─────────────────────────────────────
│   ├── create-cosmos-accounts.sh  (bg)       # Creates N Cosmos DB accounts in parallel
│   ├── az monitor app-insights ... (bg)      # Creates Application Insights
│   └── provision-benchmark-vm.sh  (bg)       # Creates VM, installs JDK 21 + Maven 3.9
│       └── SSH → apt install, download JDK   #   Writes vm-ip, vm-user, vm-key to config-dir
│                                             # Waits for all 3 background jobs
│
├── [4/5] export-cosmos-credentials.sh        # Fetches account keys → clientHostAndKey.txt
│
└── [5/5] verify-resources.sh                 # Health check: SSH to VM, test Cosmos connectivity

Config directory outputs (consumed by run skill):

<config-dir>/
├── vm-ip, vm-user, vm-key               # VM SSH connection info
├── vm-config.env                         # VM_IP, VM_USER, VM_KEY_PATH
├── clientHostAndKey.txt                  # Cosmos account endpoints + keys
├── app-insights-connection-string.txt    # Application Insights connection string
└── logs/                                 # Per-resource provisioning logs
    ├── capacity-check.json
    ├── cosmos-accounts.log
    ├── app-insights.log
    ├── vm.log
    └── export-credentials.log

2. `run` — Build & Execute Benchmarks

Clones repo at specified branch/PR/commit, builds the benchmark JAR, and executes scenarios on the VM inside a tmux session for resilience against SSH disconnections.

Script flow:

generate-tenants.sh                       # Generates tenants.json from config-dir credentials
  └── SCPs tenants.json to VM

run-all-refs.sh                           # Entrypoint: orchestrates N refs sequentially
│   (for each ref)
│   ├── SCP vm-prepare-and-run.sh → VM    # Copy bootstrapper to VM
│   ├── tmux new-session                  # Start tmux on VM (survives SSH drops)
│   │   └── vm-prepare-and-run.sh         # Runs ON the VM inside tmux
│   │       ├── git checkout <ref>        # Auto-detects branch/PR/commit/tag/fork
│   │       ├── mvn install (linting-extensions + benchmark JAR)
│   │       ├── Verify readiness (JDK, JAR, tenants.json, disk)
│   │       └── run-benchmark.sh          # Launches java benchmark process
│   │           ├── java -cp benchmark.jar Main -tenantsFile tenants.json ...
│   │           └── monitor.sh            # External JVM monitoring (threads, heap, FDs, GC)
│   └── Poll tmux until complete
└── Print summary (✅/❌ per ref)

check-status.sh                           # Standalone: check VM state anytime
  └── SSH → tmux status, results dirs, git state, build status, system resources

capture-diagnostics.sh                    # Standalone: capture thread/heap dumps mid-run
  └── SSH → jstack, jmap, JFR on running benchmark PID

Supports: multiple refs for comparison (main vs feature branch), scenario presets (SIMPLE ~30 min, EXPAND ~90 min, CHURN for leak detection), --force-copy-scripts to test local script changes.

3. `analyze` — Download & Report Results

Downloads results from the VM and generates comparison reports with pass/fail thresholds.

Script flow:

download-results.sh                       # SCP results from VM → local
  └── results/<run-name>/
      ├── monitor.csv                     # External JVM metrics (threads, heap, FDs, GC)
      ├── metrics/                        # Codahale CSV metrics (latency, throughput)
      ├── gc.log                          # G1GC log
      ├── git-info.json                   # Branch, commit SHA
      └── heap-dumps/                     # If OOM or manually triggered

generate-report.py                        # Generates markdown report
  ├── Parse monitor.csv → time-series charts (thread count, heap, FDs)
  ├── Parse metrics/ → latency percentiles, throughput tables
  ├── Compare runs → side-by-side delta tables
  └── Apply thresholds → PASS/FAIL per metric (from references/thresholds.md)

Benchmark Modes

The framework supports two modes — purely a configuration choice:

Single-tenant: Pass connection details directly via CLI flags
Multi-tenant: Pass -tenantsFile tenants.json with multiple account configurations

Both use the same JAR, orchestrator, and monitoring infrastructure.

Key Design Decisions

Scripts over code: All infrastructure and orchestration logic is in bash scripts, making it easy to run manually or debug
tmux resilience: Benchmarks run in tmux sessions on the VM, surviving SSH disconnections
Config directory pattern: Each skill reads/writes to a shared config directory, enabling clean handoff between steps
Force-copy-scripts flag: --force-copy-scripts overrides repo scripts with local versions for testing changes before they are merged
Mandatory post-launch verification: After launching, the agent must run check-status.sh to verify the benchmark is actually running before reporting success to the user

Additional Changes

Adds skill-creator skill: a meta-skill for authoring new skills with proper structure and conventions

Add benchmark shell scripts for VM provisioning, setup, execution, monitoring, diagnostics capture, and dashboard generation. Update BenchmarkConfig, BenchmarkOrchestrator, and TenantWorkloadConfig to support multi-tenant benchmark orchestration with per-tenant configuration overrides. Add .gitignore entries for benchmark artifacts and Copilot skills. Add test-setup and test-results directory scaffolding with READMEs and a sample tenants.json template (no real credentials). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Agent routing file dispatches to 5 skills covering the full benchmark/DR drill lifecycle: - provision: Cosmos DB accounts, App Insights, Azure VMs - setup: JDK/Maven install, repo clone, config generation, build - run: CHURN preset execution, multi-VM parallel, App Insights config - analyze: CSV metrics, run comparison, heap/thread dumps, Kusto export - status: resource health, run overview, App Insights verification Also includes skill-creator utility for authoring new skills. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ate runtime config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Consolidate the benchmark agent from 5 skills down to 3, with deterministic script-driven flows replacing inline commands. Skills: - setup-resources: provision Azure infra (Cosmos DB, App Insights, VM) with parallel creation, capacity validation, region fallback, and verification gate - run: clone/build/verify/execute benchmarks via single SSH session per ref, supports multiple refs for comparison, SIMPLE/EXPAND/CHURN presets - analyze: download results to config-dir/results, generate markdown report with time-series SVG charts and multi-run comparison tables Key changes: - Rename provision -> setup-resources, merge setup into run, remove status - .github/skills and .github/agents use symlinks to copilot/ (single source) - Default region westus2, resource group rg-cosmos-benchmark-YYYYMMDD - Config directory prompted with credential-in-repo warning - provision-all.sh orchestrates parallel resource creation + verification - vm-prepare-and-run.sh consolidates checkout/build/verify/run in 1 SSH session - run-all-refs.sh loops over user-provided refs with per-ref result directories - generate-report.py reads monitor.csv + metrics/*.csv, outputs report.md - Remove parse_hprof.py, kusto-schema.md, generate-dashboard.py (deferred) - Remove trigger-benchmark.sh (superseded by vm-prepare-and-run.sh) - Merge setup-benchmark-vm.sh into provision-benchmark-vm.sh Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add timestamped progress logging to validate-capacity.sh - Fix restriction detection to handle all types (Zone, NotAvailableForSubscription) - Replace slow per-SKU API calls with single-call alternative SKU search - Add --find-alternatives flag to control similar SKU search - Add restriction_reason field to JSON output - Derive quota family dynamically from effective SKU - Add --fallback-regions flag to find-region.sh for user-specified regions - Implement 4-phase search: preferred exact → preferred similar → fallback exact → fallback similar - Add [N/M] progress updates printed as each region completes - Add --stop-on-first flag (default: true) - Fix integration bugs: JSON path, exit code logic, stdin-based parsing - Update SKILL.md to document new flags and search strategy Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add capacity validation step before resource creation that blocks unless all checks pass (VM SKU, quota, Cosmos DB, App Insights) - Add --skip-capacity-check flag to override the gate - Add timestamped log() function for all progress messages - Add elapsed time tracking per resource and total provisioning time - Fix JSON parsing to match validate-capacity.sh output format - Update SKILL.md to document new behavior and flag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Wrap benchmark execution in tmux session ('bench') on VM so the process survives SSH disconnections - Add async execution guidance to SKILL.md so the agent runs the orchestrator in background mode, keeping the user's context free - Use scenario-based poll intervals (2min for SIMPLE, 5min for EXPAND/CHURN) instead of 10s fixed polling - Expand monitoring section with local and VM-side status checks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Detect refs like 'xinlian12/branchName' by checking if the part before the first slash matches an existing git remote - If remote exists, fetch from that remote; otherwise treat the slash as part of the branch name on origin - Document fork branch format in SKILL.md ref examples Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Instruct agent to proactively verify the run is progressing after async launch — if the shell exits too quickly, investigate - Add diagnosis steps: check results dirs, git state, JAR, tmux - Document common failures table (checkout, build, startup, SSH) - Require confirming with user before relaunching after a failure Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- New script checks tmux session, results directories (with per-run status), git state, build status, and optionally system resources - Supports --run-name for run-specific details (monitor samples, metrics, disk usage) and --verbose for system resource info - Updated SKILL.md to reference check-status.sh in monitoring and troubleshooting sections - Fix SSH stdin consumption in while-read loop with -n flag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- SCP vm-prepare-and-run.sh, run-benchmark.sh, monitor.sh, and capture-diagnostics.sh to ~/benchmark-scripts/ on the VM - Execute remotely via 'bash ~/benchmark-scripts/vm-prepare-and-run.sh' instead of 'bash -s' stdin piping which broke heredocs - Update vm-prepare-and-run.sh to reference co-located scripts from ~/benchmark-scripts/ in the tmux run script Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- run-all-refs.sh now only SCPs vm-prepare-and-run.sh (the bootstrapper) instead of all 4 scripts - After checkout, vm-prepare-and-run.sh resolves scripts from the cloned repo (copilot/skills/.../scripts/) so they match the ref being benchmarked - Falls back to ~/benchmark-scripts/ if the repo doesn't include the scripts yet (e.g., older branches) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- run-all-refs.sh: --force-copy-scripts copies ALL scripts to VM (not just the bootstrapper) and passes --force-scripts to the bootstrapper - vm-prepare-and-run.sh: --force-scripts overrides repo-first resolution, using ~/benchmark-scripts/ (the SCP'd copies) instead - Default behavior unchanged: repo scripts used after checkout Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- run-all-refs.sh now starts vm-prepare-and-run.sh inside a tmux session, so checkout, build, verify AND run all survive SSH disconnection - vm-prepare-and-run.sh Step 4 simplified: runs run-benchmark.sh directly (no nested tmux, no .run.sh heredoc generation) - Polling and exit code logic moved to run-all-refs.sh orchestrator Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Write a small /tmp/bench-launch.sh on the VM that wraps vm-prepare-and-run.sh and writes the exit code - Avoids nested quoting issues (SSH -> tmux -> bash -> args) - Fix stale EXIT_CODE_FILE variable reference Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Use $HOME instead of ~ in double-quoted string to ensure correct path expansion when interpolated into SSH commands. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

When run-benchmark.sh is executed from ~/benchmark-scripts/ (SCP'd copy), SCRIPT_DIR/../ doesn't point to the benchmark module. Fall back to PWD if the script's parent doesn't contain a target/ dir. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…gDirectory - --tenantsFile -> -tenantsFile (JCommander uses single dash) - Remove --scenario and --outputDir (not valid Configuration params) - Add -reportingDirectory for CSV metrics output Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace fire-and-forget async launch with a two-step workflow: Step A: Launch orchestrator with sync mode (initial_wait: 60) Step B: Mandatory verify via check-status.sh within 90s Prevents the agent from telling the user 'it's running' without actually confirming tmux is alive and results directory exists. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Previously, createBenchmarks() initialized Cosmos clients sequentially in a for loop. With 50 tenants, each taking ~10-15s (connect + create DB/container + populate docs), initialization alone took ~8-10 minutes. Now submits all tenant initializations to the existing ExecutorService in parallel, collecting results via Future.get(). With 50 tenants on a 50-thread pool, initialization completes in ~15-20s instead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This reverts commit 3639600.

Annie Liang and others added 2 commits February 27, 2026 11:40

github-actions bot added the Cosmos label Feb 27, 2026

Annie Liang and others added 2 commits February 27, 2026 14:07

Merge branch 'main' into cosmos-benchmark-agent

61a2589

Reorganize benchmark copilot: co-locate scripts with skills, consolid…

c43f5b6

…ate runtime config Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

xinlian12 force-pushed the cosmos-benchmark-agent branch from bf46a2f to c43f5b6 Compare February 27, 2026 22:50

xinlian12 changed the title ~~[Cosmos Benchmark]AddBenchmarkAgentAndSkills~~ Restructure Cosmos benchmark agent to 3 deterministic skills Mar 2, 2026

Annie Liang and others added 18 commits March 2, 2026 15:00

Fix tilde expansion in BENCH_DIR_VM path

b538c48

Use $HOME instead of ~ in double-quoted string to ensure correct path expansion when interpolated into SSH commands. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix benchmark main class to com.azure.cosmos.benchmark.Main

33bf001

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Revert "Parallelize benchmark client creation across tenants"

aa147ec

This reverts commit 3639600.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure Cosmos benchmark agent to 3 deterministic skills#48165

Restructure Cosmos benchmark agent to 3 deterministic skills#48165
xinlian12 wants to merge 23 commits intoAzure:mainfrom
xinlian12:cosmos-benchmark-agent

xinlian12 commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xinlian12 commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cosmos Benchmark Agent

Overview

How to Use (Copilot CLI)

1. Select the agent

2. Example workflows

3. Key commands

Agent Structure

Skills

1. setup-resources — Provision Azure Infrastructure

2. run — Build & Execute Benchmarks

3. analyze — Download & Report Results

Benchmark Modes

Key Design Decisions

Additional Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xinlian12 commented Feb 27, 2026 •

edited

Loading

1. `setup-resources` — Provision Azure Infrastructure

2. `run` — Build & Execute Benchmarks

3. `analyze` — Download & Report Results